First, we load a few R packages
Let’s say you have a set of \(n\) observations \(X_i = (X_{i1},...,X_{ip})\) for \(i = (1, \ldots, n)\) each with \(p\) features. This data may or may not be “high-dimensional” (or \(p>>n\)). Depending on your goal in the analysis, you will find dimensionality reduction methods can be useful in some of the following scenarios (not mutually exclusive):
As we have seen in class, visualizing data is one of the most important part of data science. The right visualization method may reveal problems with the data that can render the results from a standard analysis, although typically appropriate, completely useless. It can also help us make us important discoveries.
We have shown methods for visualizing univariate and paired data, but if we want to visualize relationships between columns or between rows in data that is high-dimensional, this can be complicated. For example, if each observation has \(p\)=100 features, we would have to create plots that reveal relationships between columns or between rows are more complicated due to the high dimensionality of data (e.g. what if there are strong correlations between the features?). For example, to compare each of the 100 features, we would have to create \({p \choose 2} = 4950\) scatter plots. Creating one single scatter plot of the data is not possible due to high dimensionality.
Methods for dimension reduction can be used to preserve important characterisitics in the data, such as the distance between features or observations, but with fewer dimensions, making data visualization (or exploratory data analysis) feasible.
We will consider different methods for dimension reduction, but the main approach we will discuss is called Principal Components Analysis. The goal here is to find a new set of multivariate variables that are uncorrelated and explain as much variance as possible.
For our purposes, EDA is the primary reason we will be using dimensionality reduction methods for this lecture.
Here we are interested in finding the best matrix created with fewer variables (lower rank) that explains the original data.
In the figure below, we see the using the first principal component only really captures the difference between light and dark. As you increase the number of principal components, you see more detail in the image.
If you go far enough, you will recover an image that looks very similar to the original, but that is of a smaller dimension than the original dataset.
We will get to this until term 2, so stayed tuned!
Here we use a data set with twin heights. We simulate \(n\)=100 two-dimensional (or \(p\)=2 features) points that represent heights of a pair of twins. For our purposes, let’s say we assume the number of features (\(p\)=2) is too large (or too high-dimensional) and we want to reduce the dimensions to \(p\)=1.
First, let’s simulate the heights of pairs of twins.
set.seed(100)
n <- 100
lim <- c(60,78)
X <- MASS::mvrnorm(n, c(69,69),
matrix(c(9,9*0.92,9*0.92,9*1),2,2))
head(X) ## [,1] [,2]
## [1,] 67.72362 67.32411
## [2,] 68.56875 70.20449
## [3,] 69.04952 68.48654
## [4,] 71.10088 72.11233
## [5,] 70.21862 68.46903
## [6,] 70.17676 69.69639
Here we will motivate methods for dimensionality reduction by showing a transformation which permits us to approximate the distance between two dimensional points with just one dimension.
Consider the first two observations (i.e. first rows) in our dataset X (red points).
plot(X, xlab = "twin height 1",
ylab = "twin height 2")
points(X[1:2,], col="red", pch=16)
lines(X[1:2,], col="red")The distance between those two points is
## [1] 3.001808
Or we can calculate the Euclidean distance between all observations pairwise using the dist() function.
## 2 3 4 5
## 1 3.001808 1.763315 5.859435 2.745157
## 2 0.000000 1.783949 3.170413 2.394554
## 3 1.783949 0.000000 4.165861 1.169231
## 4 3.170413 4.165861 0.000000 3.748604
## 5 2.394554 1.169231 3.748604 0.000000
Note, if we center the data by removing the average from both columns, we note the distance between the pair of twins do not change.
Here we use the scale() function which has a center and scale argument to remove the column mean and divide by the standard deviation for each column. You can also try the sweep() function for just centering or just scaling.
## 2 3 4 5
## 1 3.001808 1.763315 5.859435 2.745157
## 2 0.000000 1.783949 3.170413 2.394554
## 3 1.783949 0.000000 4.165861 1.169231
## 4 3.170413 4.165861 0.000000 3.748604
## 5 2.394554 1.169231 3.748604 0.000000
Ok, let’s start with the naive approach of simply removing one of the two dimensions. Let’s compare the actual distances to the distance computed with just one of the dimensions. The plot below shows the comparison to the first dimension (left) and to the second (right)
par(mfrow=c(1,2))
Z <- X[,1]
plot(dist(X), dist(Z),
xlab = "actual distance",
ylab = "Using only first X column")
abline(0,1, col = 2)
Z <-X[,2]
plot(dist(X), dist(Z),
xlab = "actual distance",
ylab = "Using only second X column")
abline(0,1, col = 2)We see the actual distance is generally underestimated using only 1 of the 2 columns of heights. This is actually to expected since we are adding more things in the actual distance. If instead we average and use this distance,
\[d_{ij} = \sqrt{ \frac{1}{2} \sum_{p=1}^2 (X_{i,p}-X_{j,p})^2 }\] then we see the bias goes away
Z <- X[,1]
par(mfrow=c(1,1))
plot(dist(X)/sqrt(2), dist(Z),
xlab = "actual distance",
ylab = "Using only first X column with scaling factor")
abline(0,1, col = 2)We also see there is a strong correlation. Can we pick a one dimensional summary that makes this correlation even stronger?
## [1] 0.97286
If we look back at the plot, and visualize a line between any pair of points, the length of this line is the distance between the two points. These lines tend to go along the direction of the diagonal. Notice that if we instead plot
avg <- rowMeans(X) ## or (X[,1] + X[,2])/2
diff <- X[,2] - X[,1]
Z <- cbind( avg, diff)
par(mfrow=c(1,2))
lim=c(-9,9)
plot(X, xlim=lim, ylim=lim,
main = "Twin height scatterplot",
xlab = "Twin height 1",
ylab = "Twin height 2")
points(X[1:2,], col="red", pch=16)
lines(X[1:2,], col="red")
plot(Z, xlim=lim, ylim=lim,
xlab = "Mean", ylab = "Difference")
points(Z[1:2,], col="red", pch=16)
lines(Z[1:2,], col="red")We see almost all the variation can be explained by the Mean axes. This means that we can essentially ignore the second dimension and not lose too much information. If the line is completely flat, we lose no information. If we use this transformation of the data instead we get much higher correlation:
## [1] 0.9973466
Note that each row of \(X\) was transformed using a linear transformation.
If you are familiar with linear algebra we can write the operation we just performed like this:
\[ Z = X A \mbox{ with } A = \, \begin{pmatrix} 1/2&1\\ 1/2&-1\\ \end{pmatrix} \]
And that we can transform back by simply multiplying by \(A^{-1}\) as follows:
\[ X = Z A^{-1} \mbox{ with } A^{-1} = \, \begin{pmatrix} 1&1\\ 1/2&-1/2\\ \end{pmatrix} \]
Note: we can actually guarantee that the distance scales remain the same if we re-scale the columns of \(A\) to assure that the sum of squares are 1:
\[a_{1,1}^2 + a_{2,1}^2 = 1\mbox{ and } a_{2,1}^2 + a_{2,2}^2=1\]
and the correlation of the columns is 0. In this particular example to achieve this, we multiply the first set of coefficients (first column of \(A\)) by \(\sqrt{2}\) and the second by \(1\sqrt{2}\), then we get the same exact distance if we use both dimensions and a great approximation if we use both.
avg <- rowMeans(X)*sqrt(2) ## or (X[,1] + X[,2])/2
diff <- (X[,2] - X[,1])/sqrt(2)
Z <- cbind( avg, diff)
par(mfrow=c(1,2))
plot(dist(X), dist(Z),
xlab = "actual distance",
ylab = "Using both columns in Z")
abline(0,1)
plot(dist(X), dist(Z[,1]),
xlab = "actual distance",
ylab = "Using only first Z column")
abline(0,1)In this case \(Z\) is called an orthogonal rotation of \(X\): it preserves the distances between points.
Let’s formalize this a bit more.
The singular value decomposition (SVD) is a generalization of the algorithm we used in the motivational section. As in the example, the SVD provides a transformation of the original data. This transformation has some very useful properties.
The main result SVD provides is that we can write an matrix \(\mathbf{X}\) (of dimension \(m \times n\)) as
\[\mathbf{U}^\top\mathbf{X} = \mathbf{DV}^\top\]
With:
with \(p=\mbox{min}(m,n)\). \(\mathbf{U}^\top\) provides a rotation of our data \(\mathbf{X}\) that turns out to be very useful because the variability of the columns of \(\mathbf{U}^\top \mathbf{X}\) (or \(\mathbf{DV}^\top\)) are decreasing. Because \(\mathbf{U}\) is orthogonal, we can write the SVD like this:
\[\mathbf{X} = \mathbf{UDV}^\top\]
Now we will apply a SVD to the motivating example we have using the svd() function in R:
## [1] 2 100
Note, we rotated X to put the (p=2) features along rows with n=100 observation along columns before applying svd().
The svd() command returns the three matrices and only the diagonal entries are returned for \(\mathbf{D}\).
## List of 3
## $ d: num [1:2] 13.9 2.19
## $ u: num [1:2, 1:2] -0.707 -0.707 -0.707 0.707
## $ v: num [1:100, 1:2] 0.0499 -0.01324 0.00826 -0.08742 -0.01062 ...
To obtain the principal components from the SVD, we simply need the columns of the rotation \(\mathbf{U}^\top\mathbf{X}\):
## [1] 2 100
We can plot the \(n=100\) twin heights along the principal components
Alternatively, we could have used \(\mathbf{DV}^\top\):
Another way of calculating the PCs is to use prcomp() function.
## List of 5
## $ sdev : num [1:2] 1.4 0.22
## $ rotation: num [1:2, 1:2] 0.707 0.707 0.707 -0.707
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:2] "PC1" "PC2"
## $ center : num [1:2] -8.40e-18 1.97e-17
## $ scale : num [1:2] 3.1 2.97
## $ x : num [1:100, 1:2] -0.694 0.184 -0.115 1.215 0.148 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:2] "PC1" "PC2"
## - attr(*, "class")= chr "prcomp"
The output pc$x contains the principal components
The PCs are the same as the right singular vectors when they are centered and scaled. (signs of PCs are arbitrary)
Y <- t(scale(X, center=TRUE, scale=TRUE))
s <- svd(Y)
pc.svd <- diag(s$d)%*%t(s$v)
pc.prcomp <- prcomp(X, center=TRUE, scale=TRUE)
plot(pc.prcomp$x[,1],-1*s$v[,1],
xlab="PC1 using prcomp",
ylab="Right singular vector 1")plot(pc.prcomp$x[,1], -1*t(pc.svd)[,1],
xlab = "PC1 using prcomp",
ylab = "PC1 using svd"); abline(0,1)Cool! So we have now reduced the \(p=2\) dimensions to 1 dimension that explains the most amount of variation.
OK, but how is this useful?
I will admit in our super simple example, it is not immediately obvious how incredibly useful the SVD can be. So let’s consider an example with more than two features. In this example, we will greatly reduce the dimension of \(\mathbf{V}\) and still be able to reconstruct \(\mathbf{X}\).
When you write and mail a letter, the first thing that happens to the letter when it’s received in the post office is that they are sorted by zip code:
Originally humans had to sort these but today, thanks to machine learning algorithms, a computer can read zip codes. The data is available in the data repository on the course github page.
if(!exists("digits")) {
digits <- read_csv("../data/hand-written-digits-train.csv") }
digits <- digits %>% sample_n(5000) # take a random sample for purposes of the lecture
X <- digits %>% select(-label) %>% as.matrix() # save just pixels
dim(X)## [1] 5000 784
Here are three images of written digits.
tmp <- lapply( c(1:3), function(i){
expand.grid(Row=1:28, Column=1:28) %>%
mutate(id=i, label=digits$label[i],
value = unlist(digits[i,-1])) })
tmp <- Reduce(rbind, tmp)
tmp %>% ggplot(aes(Row, Column, fill=value)) +
geom_raster() +
scale_y_reverse() +
scale_fill_gradient(low="white", high="black") +
facet_grid(.~label)Each image is converted into \(28 \times 28\) pixels and for each we obtain an grey scale intensity between 0 (white) to 255 (black). This means one image has 784 (=28*28) features.
We can see these values like this:
tmp %>% ggplot(aes(Row, Column, fill=value)) +
geom_point(pch=21,cex=2) +
scale_y_reverse() +
scale_fill_gradient(low="white", high="black") +
facet_grid(.~label)So for each digit \(i\) we have an outcome \(Y_i\) which can be one of 10 categories: \(0,1,2,3,4,5,6,7,8,9\) and the features \(X_{i,1}, \dots, X_{i,784}\) which can take values from 0 to 255. We use bold face to denote this vector of predictors \(\mathbf{X}_i = (X_{i,1}, \dots, X_{i,784})\).
784 features a lot of features. I’m going to assume the pixels close together are going to be somewhat correlated. Is there any room for data reduction? Can we identify a reduced set of features that can explain the variation in the data?
If you recall, the first PC is will explain the most variation, the second PC will explain the second most variation in the data, etc.
Because the pixels are so small we expect those to be close to each other on the grid to be correlated, meaning that dimension reduction should be possible.
Let’s take the SVD of \(\mathbf{X}\). We only read in 5000 observations, so this shouldn’t take very long, but if you do this on the entire dataset, this will take a little while.
## [1] 5000 784
Remember, we need to column center the data. We also will create a new variable \(\mathbf{Y}\) to represent the standardized data that is also transposed (features along rows).
## [1] 784 5000
Now apply the svd() function to \(\mathbf{Y}\).
## List of 3
## $ d: num [1:784] 40536 35266 32786 29761 29230 ...
## $ u: num [1:784, 1:784] -2.78e-17 -1.11e-16 0.00 0.00 -1.11e-16 ...
## $ v: num [1:5000, 1:784] -0.01161 0.00853 -0.02516 -0.01806 -0.01299 ...
First note that we can in fact reconstruct \(\mathbf{Y}\) using all the PCs:
## [1] 3.928804e-10
If we look at the eigenvalues in \(\mathbf{D}\), we see that the last few are quite close to 0.
This implies that the last columns of \(\mathbf{V}\) have a very small effect on the reconstruction of \(\mathbf{X}\). To see this, consider the extreme example in which the last entry of \(\mathbf{V}\) is 0. In this case the last column of \(\mathbf{V}\) is not needed at all.
Because of the way the SVD is created, the columns of \(\mathbf{V}\), have less and less influence on the reconstruction of \(\mathbf{X}\). You commonly see this described as “explaining less variance”. This implies that for a large matrix, by the time you get to the last columns, it is possible that there is not much left to “explain”.
As an example, we will look at what happens if we remove the 100 last columns:
k <- ncol(s$v)-100
Yhat <- s$u[,1:k] %*% diag(s$d)[1:k,1:k] %*% t(s$v[,1:k])
resid <- Y - Yhat
max(abs(resid))## [1] 3.928804e-10
The largest residual is practically 0, meaning that Yhat is practically the same as Y, yet we need 100 less dimensions to transmit the information.
By looking at \(\mathbf{D}\), we can see that, in this particular dataset, we can obtain a good approximation keeping only a subset of columns. The following plots are useful for seeing how much of the variability is explained by each column:
We can also make a cumulative plot:
Although we start with 784 dimensions, we can approximate \(X\) with just a few:
k <- 100 ## out a possible 784
Yhat <- s$u[,1:k] %*% diag(s$d)[1:k,1:k] %*% t(s$v[,1:k])
resid <- Y - YhatTherefore, by using only half as many dimensions, we retain most of the variability in our data:
## [1] 0.9172635
We say that we explain 92 percent of the variability in our data with 100 PCs.
Note that we can compute this proportion from \(\mathbf{D}\):
## [1] 0.9172635
The entries of \(\mathbf{D}\) therefore tell us how much each PC contributes in term of variability explained.
Another way of calculating the PCs is to use prcomp() function.
The proportion of variance of the first ten PCs is quite high (almost 50%):
## PC1 PC2 PC3 PC4 PC5
## Standard deviation 573.32405 498.77957 463.70575 420.92107 413.40976
## Proportion of Variance 0.09574 0.07246 0.06263 0.05160 0.04978
## Cumulative Proportion 0.09574 0.16819 0.23082 0.28242 0.33220
## PC6 PC7 PC8 PC9 PC10
## Standard deviation 384.74711 336.96599 313.63881 308.84477 289.24143
## Proportion of Variance 0.04311 0.03307 0.02865 0.02778 0.02437
## Cumulative Proportion 0.37531 0.40839 0.43704 0.46482 0.48918
We can also plot the standard deviations:
or the more common plot variance explained:
We can also see that the first two PCs will in fact be quite informative. Here is a plot of the first two PCs:
data.frame(PC1 = pc$x[,1], PC2 = pc$x[,2],
label=factor(digits$label)) %>%
ggplot(aes(PC1, PC2, fill=label))+
geom_point(cex=3, pch=21)We can also “see” the linear combinations on the grid to get an idea of what is getting weighted:
tmp <- lapply( c(1:4,781:784), function(i){
expand.grid(Row=1:28, Column=1:28) %>%
mutate(id=i, label=paste0("PC",i),
value = pc$rotation[,i])
})
tmp <- Reduce(rbind, tmp)
tmp %>% filter(id<5) %>%
ggplot(aes(Row, Column, fill=value)) +
geom_raster() +
scale_y_reverse() +
facet_wrap(~label, nrow = 1)tmp %>% filter(id>5) %>%
ggplot(aes(Row, Column, fill=value)) +
geom_raster() +
scale_y_reverse() +
facet_wrap(~label, nrow = 1)The main difference between PCA and factor analyais is that PCA is identifying a linear combination of variables, while factor analysis is a measurement model of a latent variable. This latent variable cannot be directly measured with a single variable (e.g. intelligence, social anxiety, soil health). Instead, it is seen through the relationships it causes in a set of \(X\) variables. Best suited for situations where we have highly correlated set of variables. It divides the variables based on their correlation into different groups, and represents each group with a factor. Aims to identify latent sources of variation. Also typically tied to Gaussian distributions.
There are a ton of packages for factor analysis, but some you can try are factor.pa() function in the psych R package, for factanal() in the stats R package.
While the goal in PCA is to find an orthogonal linear transformation that maximizes the variance of the variables, the goal of ICA is to find the linear transformation, which the basis vectors are statistically independent and non-Gaussian. Unlike PCA, the basis vectors in ICA are neither orthogonal nor ranked in order. In this case, all components are equally important. If you can think of your data as a mix of signals, then the ICA basis will have a vector for each independent signal
You can try ICA using the fastICA R package.
If there are complex polynomial relationships between your features, you might try non-linear dimensionality reduction methods
We use this technique when the data is strongly non-linear.
You can try ISOMAP using the Isomap() function in the RDRToolbox R package.
t-SNE often provides a better data visualization than PCA, because a linear method struggles with modeling curved manifolds. It focuses on preserving the distances between widely separated data points rather than on preserving the distances between nearby data points.
You can try t-SNE using the Rtsne R package.
This technique works well for high dimensional data. The run-time is shorter as compared to t-SNE.
You can try UMAP using the umap R package.